This document is the summary of the R for Data Analysis workshop.

All correspondence related to this document should be addressed to:

Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)

Email:

1 Introduction to R

1.1 Basics and Variables

R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.

10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50

R is a bit flexible with spacing (but no spacing in the name of variables and words)

10+10
## [1] 20
10                 +           10
## [1] 20

R can sometimes tell that you’re not finished yet

10 +

How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything

pay <- 250

month = 12

pay * month
## [1] 3000
salary <- pay * month

Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).

1.2 Function

Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:

my_multiplier <- function(a,b){
  result = a * b
  return (result)
}

This code do nothing. To get a result, you need to call it:

my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)

We can set a default value for our arguments:

my_multiplier2 <- function(a,b=4){
  result = a * b
  return (result)
}

my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)

Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:

round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58

Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.

1.3 Data Types

function class() is used to show what is the type of a variable.

  1. Logical: TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:
class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
  1. Numeric: all numbers e.g. 5, 10.5, 11,37; a special type of numeric is “integer” which is numbers without decimal. Integers are always numeric, but numeric is not always integer:
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
  1. Character: text for example, “I love R” or “4” or “4.5”:
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"

Can we change the type of data in a variable? Yes, you need to use the function as.---()

as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA

1.4 Data Structures

1.4.1 Vector

When there are more than one number or letter stored. Use the combine function c() for that.

sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)

sale <- c(1:10)

sale * sale
##  [1]   1   4   9  16  25  36  49  64  81 100

Subsetting a vector:

days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday"  "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"
days[c(2, 3, 4)]
## [1] "Sunday"  "Monday"  "Tuesday"
  • Exercise: Create a vector named my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:
my_vector <- (0:1000)

mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1]    0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081

1.4.2 List

List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.

my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4 5 6 7
## 
## [[5]]
## [1] "HELLO"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] FALSE

1.4.3 Factor

Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:

gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)

R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).

summary(gender)
##  female  female    male 
##       1       2       3
  • Question: why when we ran the above function i.e. summary(), it showed three and not two levels of the data? Hint: run ‘gender’.
gender
## [1] male    male    male     female female  female 
## Levels:  female female male

So, be careful of spaces!

  • Exercise: Create a gender factor with 30 male and 40 females (Hint: use the rep() function):
gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
##  [1] male   male   male   male   male   male   male   male   male   male  
## [11] male   male   male   male   male   male   male   male   male   male  
## [21] male   male   male   male   male   male   male   male   male   male  
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male

There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.

edu<-c(3,2,3,4,1,2,2,3,4)

education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College        high school    College        Uni graduated  Primary school
## [6] high school    high school    College        Uni graduated 
## Levels: Primary school < high school < College < Uni graduated
  • Exercise: We have a factor with patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:
health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
##  [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
##  [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control

Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)

health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
##  [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control

1.4.4 Matrices

All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.

my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)

my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

1.4.5 Data frames

Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:

id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
             rnorm(100, mean = 25, sd = 5))

my_dataframe <-data.frame(Patient = id,
                          Treatment = group,
                          Response = response)

We also could have done the below

my_dataframe <-data.frame(Patient = c(1:200),
                          Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
                          Response = c(rnorm(100, mean = 30, sd = 5),
                                       rnorm(100, mean = 25, sd = 5)))

In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.

head(my_dataframe) 
tail(my_dataframe)
Patient Treatment Response
1 1 Psychotherapy 35.59862
2 2 Psychotherapy 36.62962
3 3 Psychotherapy 29.01446
4 4 Psychotherapy 22.71264
5 5 Psychotherapy 33.70909
6 6 Psychotherapy 35.58150
Patient Treatment Response
195 195 Medication 19.73908
196 196 Medication 23.20115
197 197 Medication 24.25496
198 198 Medication 31.60629
199 199 Medication 14.19786
200 200 Medication 33.20383

Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:

my_dataframe[35, 3]
## [1] 39.75112
  • Exercise: How can we get all columns, but only for the first 10 participants?
my_dataframe[1:10, ]
Patient Treatment Response
1 Psychotherapy 35.59862
2 Psychotherapy 36.62962
3 Psychotherapy 29.01446
4 Psychotherapy 22.71264
5 Psychotherapy 33.70909
6 Psychotherapy 35.58150
7 Psychotherapy 26.91184
8 Psychotherapy 24.35061
9 Psychotherapy 30.09706
10 Psychotherapy 28.20006

How to get only the Response column for all participants?

my_dataframe[ , 3]
##   [1] 35.59862 36.62962 29.01446 22.71264 33.70909 35.58150 26.91184 24.35061
##   [9] 30.09706 28.20006 32.09037 25.15059 25.23804 35.73875 42.61423 32.24317
##  [17] 24.60450 33.02065 36.51803 33.96189 43.34802 31.97126 31.66812 28.79656
##  [25] 25.97128 33.38626 30.57732 23.27546 19.68985 28.17951 29.98973 27.16092
##  [33] 31.99661 26.27210 39.75112 27.47717 16.54230 21.75016 33.36571 36.44900
##  [41] 30.70574 29.09672 29.43279 22.77227 20.96891 36.73037 28.83073 31.67700
##  [49] 34.54895 28.85963 27.09316 36.16230 24.35933 23.88968 34.71001 24.40981
##  [57] 32.97373 33.39629 18.96927 35.96424 27.82566 21.47080 30.61677 32.20519
##  [65] 35.46101 26.53701 39.78074 30.75453 20.34907 30.20975 36.76597 32.91871
##  [73] 25.03069 35.65879 27.68528 36.40671 27.88243 28.90775 25.83936 41.69814
##  [81] 44.44514 27.74542 28.13034 23.14361 20.74980 31.84590 36.95674 32.19458
##  [89] 27.70576 34.45078 31.68835 29.76735 30.29011 30.04594 28.00025 34.60662
##  [97] 25.64952 31.98999 29.15195 28.50349 24.16872 19.86797 19.86773 28.39305
## [105] 22.58449 21.74178 30.88721 21.84749 28.48663 22.40085 35.49097 28.02832
## [113] 31.35924 23.76777 21.67946 14.10546 22.86193 27.90598 24.61302 23.26970
## [121] 15.32896 30.49925 21.78486 23.99288 25.47913 34.21774 24.12370 26.81657
## [129] 29.65434 18.07262 28.98266 25.36519 31.33496 15.23519 24.89588 17.40844
## [137] 35.31352 19.03414 30.53285 20.96974 36.13117 25.31783 24.70286 24.30349
## [145] 23.77750 25.06979 16.42690 29.35791 34.99587 16.98754 26.94997 28.67525
## [153] 22.93114 18.99034 13.04770 27.22615 22.70794 27.14391 15.72608 28.32175
## [161] 17.88373 31.00352 20.02655 31.80458 29.67562 23.72706 30.19105 25.62265
## [169] 15.54247 19.61728 25.53547 28.17935 30.10152 35.65953 16.25308 19.49600
## [177] 23.54306 22.12626 22.47110 27.36396 28.10063 27.52071 27.52407 27.88606
## [185] 16.75499 29.03299 24.35035 30.20982 27.80097 18.65039 24.78246 24.88516
## [193] 26.85609 33.23869 19.73908 23.20115 24.25496 31.60629 14.19786 33.20383

Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:

my_dataframe[ , "Response"]
# OR:
my_dataframe$Response

So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).

2 Data Cleaning

Now, suppose we ran an experiment with 141 depressed patients. Participants were randomly assigned into two treatment groups: CBT or Psychodynamic psychotherapy. We measured self-report depression scores at 5 different stages of treatment:

  • Stage 1: Before starting any treatment. It is our base stage (pre-test)
  • Stage 2: After 5 sessions of psychotherapy (post-test1)
  • Stage 3: After 10 sessions of psychotherapy (post-test2)
  • Stage 4: At the end of the treatment (post-test3)
  • Stage 5: Three months after the treatment (post-test4)

let’s read and check the uncleaned data. But, first thing first. let’s install and then load the tidyvese package. We also need some other packages:

# Install it
install.packages("tidyverse")

# And then load it
library(tidyverse)

# Load other packages that you have already installed
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))
head(raw_data)
progress subject response_id consent_form age gender stage1_cbt stage2_cbt stage3_cbt stage4_cbt stage5_cbt stage1_dynamic stage2_dynamic stage3_dynamic stage4_dynamic stage5_dynamic anxiety1 anxiety2 anxiety3 anxiety4 anxiety5 anxiety6 anxiety7 anxiety8 group sleep_quality life_satisfaction
100 subj1 R_1f298znjmVzcOjp I consent 18 Female 90 31 33 47 50 5 5 5 5 5 6 5 3 Psychodynamic 9 9
100 subj2 R_tL0A9P33Gi18I0N I consent 18 Male 78 46 46 11 13 6 6 6 5 6 6 5 6 CBT 9 10
100 subj3 R_1LNyJhCKxTAAMOW I consent 19 Female 68 51 24 41 24 6 5 4 5 5 6 5 6 CBT 10 8
100 subj4 R_3enxzUsEYgs5r1a I consent 27 Female 100 21 11 6 31 6 6 6 1 6 6 6 1 Psychodynamic 8 7
100 subj5 R_2Qzl2096a4KNE29 I consent 19 Male 30 28 16 6 6 6 6 5 5 6 6 6 6 CBT 11 11
100 subj6 R_esb71WOTQySjusF I consent 20 Female 79 1 57 46 57 6 6 6 5 6 6 6 6 Psychodynamic 10 10
  • Exercise: There is a dataset in the cleaned_data folder named unicef_u5mr.csv. Read the dataset using read_csv and here.
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

In order to clean the data, we use tidyverse which is a collection of packages to work with data. One of the tidyverse packages that we use regularly is dplyr which includes several functions:

  • mutate() adds new variables or change existing ones.
  • select() pick variables (columns) based on their names.
  • filter() picks cases (rows) based on their values.
  • summarise() gives a single single summary of the data (e.g., mean, counts, etc.)
  • arrange() changes the ordering of the rows.
  • group_by() divides your dataframe into grouped dataframes and allow you to do each of the above operations (except for arrange) on every one of them separately.

2.1 Select

Pick subject, age, and gender columns:

selected_data <- select(raw_data, subject, age, gender)

2.2 Filter

Now, do the following tasks: pick all the male participants, pick all the male participants or those greater than 25 years old, and finally pick all male participants and those greater than 25 years old:

# filter all males
fil_male <- filter(raw_data, gender == "Male")
# filter males and older than 25
fil_male_and_g25 <- filter(raw_data, gender == "Male" & age > 25 )
# filter males or older than 25
fil_male_or_g25 <- filter(raw_data, gender == "Male" | age > 25 )

2.3 Arrange

Arrange (order) your dataframe based on the age, once in an ascending order (youngers first) and once based on descending order (olders first):

# order participants based on their age
arranged_data <- arrange(raw_data, age)
# order participants based on their age (descendeing)
arranged_descending <- arrange(raw_data, desc(age))

2.4 Mutate

Create a column to show if the participant has finished the task or not:

mutated_data <- mutate (raw_data, finished= case_when(progress==100~ "Yes",T~ "No"))

2.5 Summarise

Summarize participants age and sd:

summarise(raw_data, mean= mean(age, na.rm=T),
          sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655

2.6 Pipe Operators

A new function: pipe operators %>% pipes a value into the next function:

raw_data %>% 
  summarise(., mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655
raw_data %>% 
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655

Calculate the age mean of younger than 25 participants only:

raw_data %>% 
  filter (age < 25) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
19.1913 1.515393

2.7 Group By

Calculate the age mean of younger than 25 participants for each gender separately:

raw_data %>% 
  filter (age < 25) %>%
  group_by(gender) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T)) %>%
  ungroup ()
gender mean sd
Female 19.21053 1.556693
Male 19.10000 1.333772
  • Exercise: Create a column to show if participant is older than 23 or not and then calculate sleep quality (sleep_quality) mean for each group separately:
raw_data %>%
  mutate(age_group = case_when(age > 23 ~ "greater than 23", T~ "younger than 23")) %>%
  group_by(age_group) %>%
  summarise(sleep_quality = mean(sleep_quality, na.rm=T))
age_group sleep_quality
greater than 23 9.000000
younger than 23 8.107438
  • Exercise: Add the anxiety total score (sum) to the dataframe and then convert subject column to factor:
anxiety_data <- raw_data %>%
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  mutate(subject= factor(subject))

2.8 Pivoting

Next, we want to pivot our data to switch between long and wide format:

# Make you data long
long_data <- raw_data %>%
  select(subject, stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic) %>%
  pivot_longer(cols = c(stage1_cbt:stage5_dynamic), names_to = 'stage', values_to = 'depression_score')

# Make you data wide
wide_data <- long_data %>%
  pivot_wider(names_from = stage, values_from= depression_score)
  • Exercise: Convert the UNICEF dataset to long and wide formats:
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

library(janitor)
unicef_data_cleaned <- unicef_data %>%
  clean_names()

unicef_long_data <- unicef_data_cleaned %>% pivot_longer(cols = c(u5mr_1950:u5mr_2015), names_to = 'year', values_to = 'u5mr')
unicef_wideg_data <- unicef_long_data %>% pivot_wider(names_from = 'year', values_from = 'u5mr')

Note: The codes for the previous exercise were taken from this blog post written by Simon Ejdemyr.

Now, let’s do some cleaning using dplyr, tidyr and other tidyverse libraries:

cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)
subject age gender group stage depression_score anxiety_total sleep_quality life_satisfaction
subj1 18 Female Psychodynamic stage1 90 39 9 9
subj1 18 Female Psychodynamic stage2 31 39 9 9
subj1 18 Female Psychodynamic stage3 33 39 9 9
subj1 18 Female Psychodynamic stage4 47 39 9 9
subj1 18 Female Psychodynamic stage5 50 39 9 9
subj2 18 Male CBT stage1 78 46 9 10

Ok, now the data is clean and tidy which means:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table (Wickham, 2014).

Check the dataframe and all the data types:

str(cleaned_data)
## tibble [655 × 9] (S3: tbl_df/tbl/data.frame)
##  $ subject          : chr [1:655] "subj1" "subj1" "subj1" "subj1" ...
##  $ age              : num [1:655] 18 18 18 18 18 18 18 18 18 18 ...
##  $ gender           : chr [1:655] "Female" "Female" "Female" "Female" ...
##  $ group            : chr [1:655] "Psychodynamic" "Psychodynamic" "Psychodynamic" "Psychodynamic" ...
##  $ stage            : chr [1:655] "stage1" "stage2" "stage3" "stage4" ...
##  $ depression_score : num [1:655] 90 31 33 47 50 78 46 46 11 13 ...
##  $ anxiety_total    : num [1:655] 39 39 39 39 39 46 46 46 46 46 ...
##  $ sleep_quality    : num [1:655] 9 9 9 9 9 9 9 9 9 9 ...
##  $ life_satisfaction: num [1:655] 9 9 9 9 9 10 10 10 10 10 ...

Finally, we save our data to the cleaned_data folder.

write_csv(cleaned_data, here("cleaned_data","cleaned_data_exp1.csv"))

3 Data Visualization

Before starting the ggplot, let’s try a visualization using a function from the Base R the plot() function shows the association of each variable against the other one in a data handy for data with few number of variables to see if there are any patterns

exam_data<- read_csv(here::here("cleaned_data", "exam_data.csv"))

plot(x = exam_data$Anxiety, y = exam_data$Exam)

The code also works without writing x and y, however, writing them is strongly recommended

plot(exam_data$Anxiety, exam_data$Exam)

ggplot, the gg in ggplot stands for grammar of graphics. Grammar of graphics basically says any graphical representation of data, can be produced by a series of layers. You can think of a layer as a plastic transparency. Lets draw the same plot using ggplot. Always, mention the data we are going to work with.

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

  • aes: aes which stands for aesthetics is a relationship between a variable in your dataset and an aspect of the plot that is going to visually convey the information to the reader

  • Visual elements are known as geoms (short for ‘geometric objects’) in ggplot 2. When we define a layer, we have to tell R what geom we want displayed on that layer (do we want a bar, line dot, etc.?)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+ geom_point()

So, lets try some of them here like shape and size. Be careful with the + sign, if you clink enter for the next part of the code, the + sign should not go to the next line

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+
  geom_point(size = 2, shape = 8)

The current plot is not very informative about the patterns for each gender.

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender))+
  geom_point(size = 2, shape = 10)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2, shape = 10)

Question: why the above code doesn’t make any change?

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2)

Can assign the first layer to a variable to reduce the length of codes for next layers.

My_graph <- ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

My_graph + geom_point()

lets add a line to the current graph

My_graph + geom_point() + geom_smooth()

Aesthetics can be set for all layers of the plot (i.e., defined in the plot as a whole) or can be set individually for each geom in a plot.

My_graph + geom_point(aes(color = Gender)) + geom_smooth()

My_graph + geom_point(aes(color = Gender)) + geom_smooth(aes(color = Gender))

The shaded area around the line is the 95% confidence interval around the line. We can switch this off by adding se = F (which is short for ‘standard error = False’)

My_graph + geom_point() + geom_smooth(se = F)

What if we want our line to be a direct line?

My_graph + geom_point() + geom_smooth(se = F, method = lm)

How to change the labels of x and y axes?

My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data.

#ggplot(data = exam_data, aes(x = Anxiety, y = Exam )) + geom_histogram()
# the code above gives an error as geom_histogram can only have x or y axis in its aes()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(y = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green")

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green", col = "red")

Let’s stop using the My_graph variable and write the whole code from the start again for a bar chart

ggplot(data = exam_data, aes(x = Sleep_quality))+
  geom_bar()

Because we want to plot a summary of the data (the mean) rather than the raw scores themselves, we have to use a stat.

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean")

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean", position = "dodge")

The other way to get the same plot that the code above gives, is using the stat_summary function that takes the following general form: stat_summary(function = x, geom = y)

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

How to combine multiple plots? How to combine multiple plots? We can use the patchwork package. A nice tutorial on using this package can be found here

p1 = My_graph + geom_point(aes(color = Gender)) + geom_smooth()

p2 = ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

p3 = ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

p4 = My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

combined = p1 + p2+ p3 + p4 + plot_layout(nrow = 4, byrow = F)

combined

p1 | p2 / p3 / p4

p1 | p2 / (p3 / p4)

ggsave() function, which is a versatile exporting function that can export as PostScript (.eps/.ps), tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (in Windows only). In its basic form, the structure of the function is very simple: ggsave(filename)

ggsave(combined, filename = here("outputs", "combined.png"), dpi=300)

Now that we learned the basics of ggplot, let’s draw some plot for our experiment data. First, we need to create a dataset with aggregated depression_score scores over group and stage. We will use this dataset for line and bar graphs.

library(ggsci)

data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))

data_exp1 <- data_exp1_orig%>% 
  #mutate_if(is.character, factor) %>%
  mutate(subject= factor(subject), # convert all characters to factor
         group = factor(group),
         stage = factor(stage))


aggregated_data_exp1 <- data_exp1 %>%
  group_by(stage, group) %>%
  mutate(depression_score = mean(depression_score)) %>%
  ungroup()


barplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=stage, y= depression_score, fill=group)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  scale_fill_jama() 

#ggsave(barplot_exp1, filename = here("outputs","barplot_exp1.png"), dpi=300)


barplot_facet_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=group, y= depression_score, fill=stage)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  facet_wrap(~stage, nrow = 1)+
  scale_fill_jco() 

#ggsave(barplot_facet_exp1, filename = here("outputs","barplot_facet_exp1.png"), dpi=300)


lineplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, group= group, color= group)) +
  geom_line(aes(linetype= group)) +
  geom_point(size= 5)+
  labs (x= '', y= "Depression Score") + 
  theme_classic() +
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  scale_color_nejm() 

#ggsave(lineplot_exp1, filename = here("outputs","lineplot_exp1.png"), dpi=300)


violinplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_violin()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_d3() 

#ggsave(violinplot_exp1, filename = here("outputs","violinplot_exp1.png"), dpi=300)


boxplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  #geom_point(position = position_dodge(width=0.75), alpha= .5)+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_simpsons() 

#ggsave(boxplot_exp1, filename = here("outputs","boxplot_exp1.png"), dpi=300)


boxplot_facet_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~group)+
  scale_color_simpsons() 

#ggsave(boxplot_facet_exp1, filename = here("outputs","boxplot_facet_exp1.png"), dpi=300)

Let’s combine our plots:

combined_plot_exp1 <- barplot_facet_exp1 / (lineplot_exp1+violinplot_exp1+boxplot_exp1)
combined_plot_exp1

And here, we save our plots to the outputs folder.

ggsave(combined_plot_exp1, filename = here("outputs","combined_plot_exp1.png"), dpi=300, width = 12)

4 Descriptive Statistics

Now, let’s do some descriptive statistics. Now, we can open a new script called data_analysis.r and read some datasets. Then we use skimr package to describe our data.

narcissism_data <- read_csv(here("cleaned_data","narcissism_data.csv"))
narcissism_data %>% skimr::skim()
Data summary
Name Piped data
Number of rows 131
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
subject 0 1 5 7 0 131 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
psychopathy 0 1 8.78 2.27 0 8.0 10 10 11 ▁▁▁▂▇
self_esteem 0 1 8.45 1.68 4 8.0 8 9 12 ▁▅▇▆▃
narcissism 0 1 38.20 6.15 19 33.5 39 43 48 ▁▂▇▇▆
mental_health 0 1 3.19 1.04 1 3.0 4 4 4 ▂▂▁▃▇
  • Exercise: Open the dataset called treatment_data.csv and do a descriptive analysis:
treatment_data <- read_csv(here("cleaned_data","treatment_data.csv"))
treatment_data %>% skimr::skim()
Data summary
Name Piped data
Number of rows 131
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
subject 0 1 5 7 0 131 0
gender 0 1 4 6 0 2 0
treatment 0 1 3 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
anxiety 0 1 62.35 24.51 0 40.0 69 81.0 100 ▂▆▃▇▆
depression 0 1 52.50 22.12 0 34.5 51 71.0 100 ▂▇▇▆▃
life_satisfaction 0 1 41.02 23.93 0 21.0 39 56.5 100 ▅▇▅▃▂
  • Exercise: Do the same thing for the memory_data.csv.
memory_data <- read_csv(here("cleaned_data","memory_data.csv"))
memory_data %>% group_by(time) %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 262
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables time

Variable type: character

skim_variable time n_missing complete_rate min max empty n_unique whitespace
subject post_test_memory 0 1 5 7 0 131 0
subject pre_test_memory 0 1 5 7 0 131 0
gender post_test_memory 0 1 4 6 0 2 0
gender pre_test_memory 0 1 4 6 0 2 0

Variable type: numeric

skim_variable time n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age post_test_memory 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
age pre_test_memory 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
memory_score post_test_memory 0 1 52.50 22.12 0 34.5 51 71.0 100 ▂▇▇▆▃
memory_score pre_test_memory 0 1 41.02 23.93 0 21.0 39 56.5 100 ▅▇▅▃▂
  • Exercise: For this exercise, we use a dataset of one of my own studies. In this study, we asked participants to guess the physical brightness of some reasoning arguments and then we gave a cognitive ability test. (See the original study here). Open ghasemi_brightness_exp4.csv file and answer to the following questions:
  1. How many participants did we test in total?
  2. Find out how many male and female we tested.
  3. Calculate mean and sd for age and cognitive ability (cog_ability).
ghasemi_data <- read_csv(here("cleaned_data","ghasemi_brightness_exp4.csv"))

ghasemi_data %>% summarise(n = n_distinct(participant)) # number of participants:200
n
200
ghasemi_data %>% group_by (participant) %>% filter (row_number()==1) %>% group_by (gender) %>% summarise(n= n()) %>% ungroup() # 183 female, 17 male
gender n
Female 183
Male 17
ghasemi_data %>% dplyr::select (age, cog_ability) %>% skimr::skim() # mean and sd for age and cognitive ability
Data summary
Name Piped data
Number of rows 38400
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 22.20 6.78 17 19 20 22 52 ▇▁▁▁▁
cog_ability 0 1 39.55 9.46 11 34 40 46 61 ▁▃▇▆▂

5 Data Analysis

5.1 t-test

Now, we use the treatment data to run three different independent t-tests. Suppose we did an experiment to compare the effectiveness of CBT vs. Psychodynamic therapies in decreasing anxiety, and depression and also in improving life satisfaction:

# t.test (indep)
t.test(anxiety~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  anxiety by treatment
## t = -0.85021, df = 124.18, p-value = 0.3968
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.11096   4.83264
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    60.54545                    64.18462
t.test(depression~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  depression by treatment
## t = -2.8725, df = 123.97, p-value = 0.004792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.21965  -3.35424
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    47.15152                    57.93846
t.test(life_satisfaction~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  life_satisfaction by treatment
## t = -5.2688, df = 127.11, p-value = 0.0000005699
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -27.61850 -12.53721
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    31.06061                    51.13846

In another experiment, suppose we have created a method to boost memory. Then, we recruit some participants, do a memory pre-test, implement the method, and do a memory post-test, Now, we want to see whether our method have improved participants’ memory:

# t.test (paired)
t.test(memory_score~time, data= memory_data, paired= T)
## 
##  Paired t-test
## 
## data:  memory_score by time
## t = 5.4761, df = 130, p-value = 0.0000002163
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.333171 15.628661
## sample estimates:
## mean of the differences 
##                11.48092

Now that we learned about t-test, let’s perform this test on our dataset. Is there a difference between groups at the first stage? Ideally, we want participants’ depresion score at the first stage to be similar for both groups because we have not started our treatment yet. Previous graphs showed us that depression scores of the CBT and Psychodynamic groups at this stage are pretty close. Let’s test that using an independent t-test (because we have 2 independent groups):

# Is there a difference between groups at the first stage?
data_exp1 %>% 
  group_by(group) %>% 
  filter(stage=='stage1') %>% 
  ungroup () %>%
  t.test(depression_score~group, data = ., paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  depression_score by group
## t = 0.10768, df = 118.92, p-value = 0.9144
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.205588 10.264329
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    53.59091                    53.06154

Now, we wonder if psychotherapy treatments were effective at all, regardless of the treatment method. So, we would like to test if depresion score at the forth stage are lower than scores at the stage 2? Since a pair of score at stage 2 and stage 4 is coming from a same person, we use paired t-test.

# Is there a difference between ratings of stage2 and stage4?
data_exp1 %>% 
  filter(stage=='stage2' | stage=='stage4') %>% 
  ungroup () %>%
  t.test(depression_score~stage, data = ., paired=TRUE)
## 
##  Paired t-test
## 
## data:  depression_score by stage
## t = 5.5931, df = 130, p-value = 0.0000001261
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.70108 16.13098
## sample estimates:
## mean of the differences 
##                11.91603
  • Exercise: John et al. (2019) investigated the consequences of backing down (changing one’s mind in lights of evidence)and how other people view someone who change their mind. In their second experiments, they presented participants either with a person who changes their mind or a person who refuses to back down. Then, they asked participants to rate how intelligent and confident the person is (See the original study here). They reported that:

“Relative to the entrepreneur who did not back down, participants judged the entrepreneur who backed down as more intelligent (M_backed_down=5.13 out of 7, SD=1.09; M_did_not_back_down=3.97, SD=1.54; t(271.12)=−7.59, p < .001) but less confident (M_backed_down=4.50 out of 7, SD=1.36; M_did_not_back_down=5.65, SD=1.10; t(291.01)=8.08, p < .001).”.

Open the john_backdown_exp2.csv file and try to reproduce their results. Run two separate independent t-test, one with intelligent as the dependent variable and one with confident as the dependent variable. For both t-test, use back_down as the between-subject independent variable.

john_data <- read_csv(here("cleaned_data","john_backdown_exp2.csv"))


t.test(intelligent~back_down, data = john_data, paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  intelligent by back_down
## t = 7.5853, df = 271.12, p-value = 0.0000000000005319
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8577107 1.4590076
## sample estimates:
##       mean in group backed_down mean in group did_not_back_down 
##                        5.129412                        3.971053
t.test(confident~back_down, data = john_data, paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  confident by back_down
## t = -8.0763, df = 291.01, p-value = 0.00000000000001787
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4257768 -0.8670294
## sample estimates:
##       mean in group backed_down mean in group did_not_back_down 
##                        4.503268                        5.649671

5.2 Analysis of Variance (ANOVA)

Now, let’s analysis our main experiment data: Do participants in the CBT group show better outcome compared to participants in the Psychodynamic group? Suppose we believe that participants should show lower depression after 5 or 10 sessions of both psychotherapy treatments and this decrease should be more pronounced for CBT than psychodynamic psychotherapy. If this is the case. we expect an interaction in the traditional Analysis of Variance (AONVA) test.

aov_m1 <- aov_car (depression_score ~ group*stage +
                     Error(subject/stage), data = data_exp1)  
Effect df MSE F ges p.value
group 1, 129 737.60 27.08 *** .066 <.001
stage 2.97, 382.72 492.81 53.15 *** .215 <.001
group:stage 2.97, 382.72 492.81 8.91 *** .044 <.001

As you can see, we found a significant main effect of stage and a significant group by stage interaction. We can use the emmeans package to do post-hoc tests.

# main effect of stage
emmeans(aov_m1, 'stage')
##  stage  emmean   SE  df lower.CL upper.CL
##  stage1   53.3 1.83 579     49.7     56.9
##  stage2   33.3 1.83 579     29.7     36.9
##  stage3   26.3 1.83 579     22.7     29.9
##  stage4   21.4 1.83 579     17.8     25.0
##  stage5   31.4 1.83 579     27.8     35.0
## 
## Results are averaged over the levels of: group 
## Warning: EMMs are biased unless design is perfectly balanced 
## Confidence level used: 0.95
pairs(emmeans(aov_m1, 'stage'), adjust= 'holm')
##  contrast        estimate   SE  df t.ratio p.value
##  stage1 - stage2    20.03 2.36 516  8.480  <.0001 
##  stage1 - stage3    26.94 2.36 516 11.404  <.0001 
##  stage1 - stage4    31.91 2.36 516 13.506  <.0001 
##  stage1 - stage5    21.84 2.36 516  9.245  <.0001 
##  stage2 - stage3     6.91 2.36 516  2.924  0.0144 
##  stage2 - stage4    11.87 2.36 516  5.027  <.0001 
##  stage2 - stage5     1.81 2.36 516  0.766  0.4442 
##  stage3 - stage4     4.97 2.36 516  2.102  0.0941 
##  stage3 - stage5    -5.10 2.36 516 -2.158  0.0941 
##  stage4 - stage5   -10.07 2.36 516 -4.261  0.0001 
## 
## Results are averaged over the levels of: group 
## P value adjustment: holm method for 10 tests
# group by stage interaction
emmeans(aov_m1, "group", by= "stage")
## stage = stage1:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             53.5 2.59 577    48.47     58.6
##  Psychodynamic   53.0 2.60 580    47.92     58.1
## 
## stage = stage2:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             30.7 2.59 577    25.58     35.7
##  Psychodynamic   35.9 2.60 580    30.75     41.0
## 
## stage = stage3:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             21.7 2.59 577    16.62     26.8
##  Psychodynamic   31.0 2.60 580    25.89     36.1
## 
## stage = stage4:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             13.4 2.59 577     8.29     18.4
##  Psychodynamic   29.4 2.60 580    24.29     34.5
## 
## stage = stage5:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             18.8 2.59 577    13.74     23.9
##  Psychodynamic   44.1 2.60 580    38.96     49.2
## 
## Warning: EMMs are biased unless design is perfectly balanced 
## Confidence level used: 0.95
update(pairs(emmeans(aov_m1, "group", by= "stage")), by = NULL, adjust = "holm") 
##  contrast            stage  estimate   SE  df t.ratio p.value
##  CBT - Psychodynamic stage1    0.529 3.67 579  0.144  0.8852 
##  CBT - Psychodynamic stage2   -5.195 3.67 579 -1.417  0.3138 
##  CBT - Psychodynamic stage3   -9.288 3.67 579 -2.534  0.0346 
##  CBT - Psychodynamic stage4  -16.022 3.67 579 -4.371  0.0001 
##  CBT - Psychodynamic stage5  -25.244 3.67 579 -6.887  <.0001 
## 
## P value adjustment: holm method for 5 tests

You can use the afex_plot function from afex to create beautiful plots. Those plots interacts nicely with ggplot:

afex_plot(aov_m1, x = "stage", trace = "group", error='between',
          line_arg = list(size=1),
          point_arg = list(size=3.5),
          data_arg = list(size= 1, color= 'grey', width=.4),
          data_geom = geom_boxplot,
          mapping = c("linetype", "shape", "fill"),
          legend_title = "Group") +
  labs(y = "Depression Score", x = "") +
  theme_bw()+ # remove the grey background and grid
  theme(axis.text=element_text(size=13),
        axis.title = element_text(size = 13),
        legend.text=element_text(size=13),
        legend.title=element_text(size=13),
        legend.position='bottom',
        legend.key.size = unit(1, "cm"),
        legend.background = element_rect(colour = 'black', fill = 'white', linetype='solid'))+
  scale_color_simpsons() +
  scale_fill_simpsons()

If you are interested in this topic, check out this nice tutorial about using afex to run ANOVA, and also this interesting tutorial on the emmeans package.

  • Exercise: Rotello et al. (2018) investigated the association between the race (White vs. Black faces) and the gun-tool judgments. In their first experiments, they presented participants with 16 White male faces and 16 Black male faces, and following that 8 images of guns and 8 images of tools. They asked participants to judge if the object is a tool or a gun by pressing keyboard buttons. Then, they ran an ANOVA to see if participants’ gun responses are higher for any of the races. So, they included prime race (Black, White) and target identity (gun, tool) as independent variables and participants’ gun responses as dependent variable into their linear model (See the original study here). They found that:

“Participants made more gun responses to guns than to tools, F(1,45) = 53243, p < 0.0001, η2g = 0.998. However, the race of the prime face did not matter, F(1,45) = 0.287, p > 0.59, η2g = 0.001, nor was there an interaction of prime race with target object, F(1,45) = 0.022, p > 0.88, η2g = 0.000)”.

Open the rotello_shooter_exp1.csv file and try to reproduce their results. Run an ANOVA (type III) with resp as the dependent variable and target, prime, and their interaction as independent variables.

# load the general data file
rotello_data <- read_csv(here("cleaned_data","rotello_shooter_exp1.csv"))

# ANOVA
rotello_aov <- aov_car (resp ~ target*prime +
           Error(subject/target*prime), data = rotello_data)
Effect df MSE F ges p.value
target 1, 45 0.00 53242.99 *** .998 <.001
prime 1, 45 0.00 0.29 .001 .595
target:prime 1, 45 0.00 0.02 <.001 .883

5.3 Correlation

Here, we want to check the correlation between variables on the narcissism_data. First, we need to remove subject column because it is not numeric:

narcissism_data_cor <- narcissism_data %>%
  select(-subject)
#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()

#-- apaTables library:
narcissism_data_cor %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
psychopathy self_esteem narcissism mental_health
psychopathy 1.00 0.15 0.40 -0.44
self_esteem 0.15 1.00 0.11 -0.29
narcissism 0.40 0.11 1.00 -0.26
mental_health -0.44 -0.29 -0.26 1.00
Parameter mental_health narcissism self_esteem
psychopathy -0.44 0.40 0.15
self_esteem -0.29 0.11
narcissism -0.26
  • Exercise: Pennycook et al. (2020) investigated the relationship between actively open-minded thinking style about evidence (AOT-E) and different political, scientific, and religious beliefs (see the original paper here). In their first experiment, they calculated the correlation of AOTE and scientific beliefs items (global warming, evolution, etc.) and they found the following results:

Open the pennycook_aote_exp1.csv file and try to reproduce their results by creating the same correlation matrix.

pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) 


#---------- Base R:
cor(pennycook_data, method = "pearson",  use = "complete.obs")

#---------- Psych library:
pennycook_data %>% 
  psych::pairs.panels(method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#---------- Correlation library:
correlation::correlation(pennycook_data) %>% summary()

#---------- apaTables library:
pennycook_data %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
Parameter trust_scien gm_health tech_problems modern_medicine old_earth vaccines stem_cell big_bang evolution global_warming
aote 0.35 0.36 0.44 0.33 0.40 0.47 0.45 0.51 0.51 0.37
global_warming 0.42 0.06 0.14 0.18 0.33 0.26 0.31 0.33 0.38
evolution 0.48 0.33 0.28 0.36 0.47 0.39 0.54 0.78
big_bang 0.49 0.37 0.28 0.36 0.45 0.37 0.54
stem_cell 0.47 0.34 0.36 0.47 0.40 0.40
vaccines 0.43 0.52 0.49 0.53 0.38
old_earth 0.29 0.24 0.21 0.33
modern_medicine 0.43 0.42 0.47
tech_problems 0.33 0.39
gm_health 0.31

5.4 Linear Regression

Here, we do single and multiple linear regreassion on the narcissism_data:

m1 <- lm(mental_health~narcissism, data= narcissism_data)
term estimate std.error statistic p.value
(Intercept) 4.86 0.56 8.75 0
narcissism -0.04 0.01 -3.04 0
m2 <- lm(mental_health~narcissism+psychopathy, data= narcissism_data)
term estimate std.error statistic p.value
(Intercept) 5.43 0.53 10.27 0.00
narcissism -0.02 0.01 -1.09 0.28
psychopathy -0.19 0.04 -4.71 0.00
  • Exercise: Trémolière and Djeriouat (2020) examined the role of cognitive reflection and belief in science in climate change skepticism. In their first study, they revealed that cognitive reflection and belief in science negetively predicted climate change skepticism even after controlling for demographic and cognitive ability variables (see the original paper here).

Open the tremoliere_data_exp1.csv file and try to reproduce their results by running a multiple linear regression. Enter age, gender, education, belief in science, literacy, numeracy (Numtotal), and cognitive reflection as predictors and enter climate change skepticism (climato) as the outcome variable.

Tremoliere_data <- read_csv(here("cleaned_data","tremoliere_data_exp1.csv"))

Tremoliere_reg=lm(Climato ~ Age+ Gender+ Education+ BeliefInSciencetotal+ Literacy+ Numtotal+ CognitiveReflection,
                    data=Tremoliere_data)
term estimate std.error statistic p.value
(Intercept) 57.57 5.19 11.09 0.00
Age 0.01 0.05 0.24 0.81
Gender -5.68 1.34 -4.23 0.00
Education 0.54 0.38 1.43 0.15
BeliefInSciencetotal -0.20 0.06 -3.62 0.00
Literacy -0.49 0.51 -0.96 0.34
Numtotal -1.52 0.83 -1.82 0.07
CognitiveReflection -18.58 4.26 -4.37 0.00
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.19 0.17 12.65 11.91 0 7 -1467.77 2953.54 2988.81 58235.89 364 372

6 Rmarkdown

To be completed…

7 References

  • Ghasemi, O., Handley, S., & Howarth, S. (2020). The Bright Homunculus in our Head: Individual Differences in Intuitive Sensitivity to Logical Validity.

  • John, L. K., Jeong, M., Gino, F., & Huang, L. (2019). The self-presentational consequences of upholding one’s stance in spite of the evidence. Organizational Behavior and Human Decision Processes, 154, 1-14.

  • Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2020). On the belief that beliefs should change according to evidence: Implications for conspiratorial, moral, paranormal, political, religious, and science beliefs. Judgment and Decision Making, 15(4), 476.

  • Rotello, C. M., Kelly, L. J., Heit, E., Vazire, S., & Vul, E. (2018). The Shape of ROC Curves in Shooter Tasks: Implications for Best Practices in Analysis. Collabra: Psychology, 4(1).

  • Trémolière, B., & Djeriouat, H. (2020). Don’t you see that its cold! Exploring the roles of cognitive reflection, climate science literacy, illusion of knowledge, and political orientation in climate change skepticism.

  • Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

---
title: "R for Data Analysis"
author:
  - name: "Omid Ghasemi"
    affiliation: Macquarie University
    email: omidreza.ghasemi@hdr.mq.edu.au
  - name: "Mahdi Mazidi"
    affiliation: University of Western Australia
    email: mahdi.mazidisharafabadi@research.uwa.edu.au
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
    keep_md: yes
    number_sections: true
    theme: cerulean
    code_download: true
    #code_folding: hide
    toc: true
    toc_float: true
    df_print: "kable"
---

This document is the summary of the **R for Data Analysis** workshop. 

All correspondence related to this document should be addressed to: 

<center>
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA) 

Email: omidreza.ghasemi@hdr.mq.edu.au 
</center>



<style>

body{ /* Normal  */
      text-align: justify;
      font-family: "Times New Roman", Times, serif;
}
code.r{ /* Code block */
    font-size: 14px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 12px;
}

</style>


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.align="center")
```



```{r libraries, message=FALSE, echo=F}
# load libraries
library(tidyverse)
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
```


# Introduction to R

## Basics and Variables

```{r echo=FALSE, out.width="700px", out.height="700px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','r_first_then.png'))
```


R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.

```{r}
10 + 10

4 ^ 2

(250 / 500) * 100
```

R is a bit flexible with spacing (but no spacing in the name of variables and words)

```{r}
10+10

10                 +           10
```

R can sometimes tell that you're not finished yet

```{r eval=F}
10 +
```

How to create a *variable*? Variable assignment using `<-` and `=`. Note that R is case sensitive for everything

```{r}
pay <- 250

month = 12

pay * month

salary <- pay * month
```


Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).

## Function 
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to *define* it:

```{r}
my_multiplier <- function(a,b){
  result = a * b
  return (result)
}
```

This code do nothing. To get a result, you need to *call* it:

```{r}
my_multiplier (a=2, b=4)
# or: my_multiplier (2, 4)
```

We can set a default value for our arguments:

```{r}
my_multiplier2 <- function(a,b=4){
  result = a * b
  return (result)
}

my_multiplier2 (a=2)
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
```

Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
```{r}
round(54.6787)
round(54.5787, digits = 2)
```

Use `?` before the function name to get some help. For example, `?round`. You will see many functions in the rest of the workshop.

## Data Types

function `class()` is used to show what is the type of a variable.


1. *Logical*: `TRUE`, `FALSE` can be abbreviated as `T`, `F`.  They has to be capital, 'true' is not a logical data:
```{r}
class(TRUE)
class(F)
```

2. *Numeric*: all numbers e.g. 5,  10.5,  11,37;  a special type of numeric is "integer" which is numbers without decimal. Integers are always numeric, but numeric is not always integer:
```{r}
class(2)
class(13.46)
```

3. *Character*: text for example, "I love R" or "4" or "4.5":
```{r}
class("ha ha ha ha")
class("56.6")
class("TRUE")
```

Can we change the type of data in a variable? Yes, you need to use the function `as.---()`

```{r}
as.numeric(TRUE)
as.character(4)
as.numeric("4.5")
as.numeric("Hello")
```


## Data Structures


### Vector 

When there are more than one number or letter stored. Use the combine function c() for that.

```{r}
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)

sale <- c(1:10)

sale * sale
```

*Subsetting a vector*:

```{r}
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days[2]
days[-2]

days[c(2, 3, 4)]
```


* *Exercise*: Create a vector named `my_vector` with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:

```{r}
my_vector <- (0:1000)

mean(my_vector)
median(my_vector)
min(my_vector)
range(my_vector)
class(my_vector)
sum(my_vector)
sd(my_vector)
```

### List

List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.

```{r}
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
```

### Factor
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with "male" and "female" entries:

```{r}
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
```

R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
```{r}
summary(gender)
```

* *Question*: why when we ran the above function i.e. summary(), it showed three and not two levels of the data? *Hint*: run 'gender'.

```{r}
gender
```

So, be careful of spaces!

* *Exercise*: Create a gender factor with 30 male and 40 females (*Hint*: use the `rep()` function):
```{r}
gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
```

There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the `factor()` function: `ordered = TRUE`, and `levels = c("level1", "level2")`. For example, we have a vector that shows participants' education level.

```{r}
edu<-c(3,2,3,4,1,2,2,3,4)

education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
```

* *Exercise*: We have a factor with `patient` and `control` values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:

```{r}
health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status

health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
```

Finally, can you relabel both levels to uppercase characters? (*Hint*: check `?factor`)

```{r}
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
```


### Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.

```{r}
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)

my_matrix
```

### Data frames 

Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let's create a dataframe:

```{r}
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
             rnorm(100, mean = 25, sd = 5))

my_dataframe <-data.frame(Patient = id,
                          Treatment = group,
                          Response = response)
```

We also could have done the below

```{r}
my_dataframe <-data.frame(Patient = c(1:200),
                          Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
                          Response = c(rnorm(100, mean = 30, sd = 5),
                                       rnorm(100, mean = 25, sd = 5)))
```

In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.

```{r eval=F}
head(my_dataframe) 
tail(my_dataframe)
```

```{r echo=F}
head(my_dataframe) %>%
  mutate(` `= c(1:6)) %>%
  select(` `, Patient, Treatment,	Response) %>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

tail(my_dataframe)%>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:

```{r}
my_dataframe[35, 3]
```

* *Exercise*: How can we get all columns, but only for the first 10 participants?

```{r eval=F}
my_dataframe[1:10, ]
```

```{r echo=F}
knitr::kable(my_dataframe[1:10, ]) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

```
How to get only the Response column for all participants?

```{r}
my_dataframe[ , 3]
```

Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
```{r eval=F}
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response

```

So far, we created dataframes using `data.frame` function from the base R. However, a better way to create dataframes is to use the `tibble` function from tidyverse (see [here](https://r4ds.had.co.nz/tibbles.html)).

# Data Cleaning

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','environmental-data-science-r4ds-general.png'))
```


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','cracked_setwd.png'))
```

Now, suppose we ran an experiment with 141 depressed patients. Participants were randomly assigned into two treatment groups: CBT or Psychodynamic psychotherapy. We measured self-report depression scores at 5 different stages of treatment: 

- Stage 1: Before starting any treatment. It is our base stage (pre-test)
- Stage 2: After 5 sessions of psychotherapy (post-test1)
- Stage 3: After 10 sessions of psychotherapy (post-test2)
- Stage 4: At the end of the treatment (post-test3)
- Stage 5: Three months after the treatment (post-test4)

let's read and check the uncleaned data. But, first thing first. let's install and then load the tidyvese package. We also need some other packages:

```{r message=F, warning=F, eval=F}

# Install it
install.packages("tidyverse")

# And then load it
library(tidyverse)

# Load other packages that you have already installed
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
```

```{r message=F, warning=F, eval=F}
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))
head(raw_data)
```

```{r message=F, warning=F, echo=F}
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))

knitr::kable(head(raw_data)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```

* *Exercise*: There is a dataset in the `cleaned_data` folder named `unicef_u5mr.csv`. Read the dataset using `read_csv` and `here`.
```{r message=F, warning=F}
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))
```

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','tidydata_3.jpg'))
```

In order to clean the data, we use *tidyverse* which is a collection of packages to work with data. One of the tidyverse packages that we use regularly is `dplyr` which includes several functions:

- `mutate()` adds new variables or change existing ones.
- `select()` pick variables (columns) based on their names.
- `filter()` picks cases (rows) based on their values.
- `summarise()` gives a single single summary of the data (e.g., mean, counts, etc.)
- `arrange()` changes the ordering of the rows.
- `group_by()` divides your dataframe into grouped dataframes and allow you to do each of the above operations (except for `arrange`) on every one of them separately.

## Select

Pick `subject`, `age`, and `gender` columns:

```{r message=F, warning=F, eval=F}
selected_data <- select(raw_data, subject, age, gender)
```

## Filter
Now, do the following tasks: pick all the male participants, pick all the male participants **or** those greater than 25 years old, and finally pick all male participants **and** those greater than 25 years old:

```{r message=F, warning=F, eval=F}
# filter all males
fil_male <- filter(raw_data, gender == "Male")
# filter males and older than 25
fil_male_and_g25 <- filter(raw_data, gender == "Male" & age > 25 )
# filter males or older than 25
fil_male_or_g25 <- filter(raw_data, gender == "Male" | age > 25 )
```

## Arrange 
Arrange (order) your dataframe based on the age, once in an ascending order (youngers first) and once based on descending order (olders first):

```{r message=F, warning=F}
# order participants based on their age
arranged_data <- arrange(raw_data, age)
# order participants based on their age (descendeing)
arranged_descending <- arrange(raw_data, desc(age))
```

## Mutate
Create a column to show if the participant has finished the task or not:
```{r message=F, warning=F}
mutated_data <- mutate (raw_data, finished= case_when(progress==100~ "Yes",T~ "No"))
```

## Summarise
Summarize participants age and sd:
```{r message=F, warning=F}
summarise(raw_data, mean= mean(age, na.rm=T),
          sd= sd (age, na.rm=T))
```

## Pipe Operators
A new function: **pipe operators** `%>%` pipes a value into the next function:

```{r message=F, warning=F}
raw_data %>% 
  summarise(., mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```


```{r message=F, warning=F}
raw_data %>% 
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```

Calculate the age mean of younger than 25 participants only:

```{r message=F, warning=F}
raw_data %>% 
  filter (age < 25) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```

## Group By

Calculate the age mean of younger than 25 participants  for each gender separately:

```{r message=F, warning=F}
raw_data %>% 
  filter (age < 25) %>%
  group_by(gender) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T)) %>%
  ungroup ()
```         


* *Exercise*: Create a column to show if participant is older than 23 or not and then calculate sleep quality (`sleep_quality`) mean for each group separately:
```{r message=F, warning=F}
raw_data %>%
  mutate(age_group = case_when(age > 23 ~ "greater than 23", T~ "younger than 23")) %>%
  group_by(age_group) %>%
  summarise(sleep_quality = mean(sleep_quality, na.rm=T))
```     

* *Exercise*: Add the anxiety total score (sum) to the dataframe and then convert subject column to factor:
```{r message=F, warning=F}
anxiety_data <- raw_data %>%
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  mutate(subject= factor(subject))
``` 

## Pivoting

Next, we want to pivot our data to switch between long and wide format:

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','tidydata_1.jpg'))
```

```{r message=F, warning=F}

# Make you data long
long_data <- raw_data %>%
  select(subject, stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic) %>%
  pivot_longer(cols = c(stage1_cbt:stage5_dynamic), names_to = 'stage', values_to = 'depression_score')

# Make you data wide
wide_data <- long_data %>%
  pivot_wider(names_from = stage, values_from= depression_score)

```

* *Exercise*: Convert the UNICEF dataset to long and wide formats:
```{r message=F, warning=F}
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

library(janitor)
unicef_data_cleaned <- unicef_data %>%
  clean_names()

unicef_long_data <- unicef_data_cleaned %>% pivot_longer(cols = c(u5mr_1950:u5mr_2015), names_to = 'year', values_to = 'u5mr')
unicef_wideg_data <- unicef_long_data %>% pivot_wider(names_from = 'year', values_from = 'u5mr')
```

*Note*: The codes for the previous exercise were taken from [this blog post](https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/) written by Simon Ejdemyr.

Now, let's do some cleaning using `dplyr`, `tidyr` and other `tidyverse` libraries: 
```{r message=F, warning=F, eval=F}
cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)
```


```{r message=F, warning=F, echo=F}
cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)

knitr::kable(head(cleaned_data)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```

Ok, now the data is clean and tidy which means:

> 1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table ([Wickham](https://vita.had.co.nz/papers/tidy-data.pdf), 2014).


Check the dataframe and all the data types:
```{r}
str(cleaned_data)
```

Finally, we save our data to the `cleaned_data` folder.

```{r}
write_csv(cleaned_data, here("cleaned_data","cleaned_data_exp1.csv"))
```


# Data Visualization

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','ggplot2_masterpiece.png'))
```

Before starting the ggplot, let's try a visualization using a function from the Base R the plot() function shows the association of each variable against the other one in a data handy for data with few number of variables to see if there are any patterns

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=4}

exam_data<- read_csv(here::here("cleaned_data", "exam_data.csv"))

plot(x = exam_data$Anxiety, y = exam_data$Exam)

```

The code also works without writing x and y, however, writing them is strongly recommended

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}

plot(exam_data$Anxiety, exam_data$Exam)
```

`ggplot`, the gg in ggplot stands for grammar of graphics. Grammar of graphics basically says any graphical representation of data, can be produced by a series of layers. You can think of a layer as a plastic transparency. Lets draw the same plot using ggplot. Always, mention the data we are going to work with.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))
```


- `aes`: aes which stands for aesthetics is a relationship between a variable in your dataset and an aspect of the plot that is going to visually convey the information to the reader

- Visual elements are known as geoms (short for 'geometric objects') in ggplot 2. When we define a layer, we have to tell R what geom we want displayed on that layer (do we want a bar, line dot, etc.?)

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+ geom_point()
```

So, lets try some of them here like shape and size. Be careful with the + sign, if you clink enter for the next part of the code, the + sign should not go to the next line

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+
  geom_point(size = 2, shape = 8)
```

The current plot is not very informative about the patterns for each gender.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender))+
  geom_point(size = 2, shape = 10)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2, shape = 10)
```

Question: why the above code doesn't make any change?

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2)
```

Can assign the first layer to a variable to reduce the length of codes for next layers.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph <- ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

My_graph + geom_point()
```

lets add a line to the current graph
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth()
```

Aesthetics can be set for all layers of the plot (i.e., defined in the plot as a whole) or can be set individually for each geom in a plot.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point(aes(color = Gender)) + geom_smooth()

My_graph + geom_point(aes(color = Gender)) + geom_smooth(aes(color = Gender))
```

The shaded area around the line is the 95% confidence interval around the line. We can switch this off by  adding `se = F` (which is short for 'standard error = False')

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F)
```


What if we want our line to be a direct line?
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F, method = lm)

```
How to change the labels of x and y axes?
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")
```

Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
#ggplot(data = exam_data, aes(x = Anxiety, y = Exam )) + geom_histogram()
# the code above gives an error as geom_histogram can only have x or y axis in its aes()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(y = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green")

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green", col = "red")
```

Let's stop using the My_graph variable and write the whole code from the start again for a bar chart
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality))+
  geom_bar()
```
Because we want to plot a summary of the data (the mean) rather than the raw scores themselves, we have to use a stat.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean")


ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean", position = "dodge")
```

The other way to get the same plot that the code above gives, is using the stat_summary function that takes the following general form: `stat_summary(function = x, geom = y)`

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")
```

How to combine multiple plots? How to combine multiple plots? We can use the `patchwork` package. A nice tutorial on using this package can be found [here](https://patchwork.data-imaginist.com/articles/patchwork.html)


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','patchwork_1.jpg'))
```

```{r message=F, warning=F, dpi= 300, fig.height=5, fig.width=6}
p1 = My_graph + geom_point(aes(color = Gender)) + geom_smooth()

p2 = ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

p3 = ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

p4 = My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

combined = p1 + p2+ p3 + p4 + plot_layout(nrow = 4, byrow = F)

combined

p1 | p2 / p3 / p4

p1 | p2 / (p3 / p4)
```


`ggsave()` function, which is a versatile exporting function that can export as PostScript (.eps/.ps), tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (in Windows only). In its basic form, the structure of the function is very simple: `ggsave(filename)`

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=5}
ggsave(combined, filename = here("outputs", "combined.png"), dpi=300)
```


Now that we learned the basics of ggplot, let's draw some plot for our experiment data. First, we need to create a dataset with aggregated `depression_score` scores over `group` and `stage`. We will use this dataset for line and bar graphs.

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=5}
library(ggsci)

data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))

data_exp1 <- data_exp1_orig%>% 
  #mutate_if(is.character, factor) %>%
  mutate(subject= factor(subject), # convert all characters to factor
         group = factor(group),
         stage = factor(stage))


aggregated_data_exp1 <- data_exp1 %>%
  group_by(stage, group) %>%
  mutate(depression_score = mean(depression_score)) %>%
  ungroup()


barplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=stage, y= depression_score, fill=group)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  scale_fill_jama() 

#ggsave(barplot_exp1, filename = here("outputs","barplot_exp1.png"), dpi=300)


barplot_facet_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=group, y= depression_score, fill=stage)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  facet_wrap(~stage, nrow = 1)+
  scale_fill_jco() 

#ggsave(barplot_facet_exp1, filename = here("outputs","barplot_facet_exp1.png"), dpi=300)


lineplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, group= group, color= group)) +
  geom_line(aes(linetype= group)) +
  geom_point(size= 5)+
  labs (x= '', y= "Depression Score") + 
  theme_classic() +
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  scale_color_nejm() 

#ggsave(lineplot_exp1, filename = here("outputs","lineplot_exp1.png"), dpi=300)


violinplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_violin()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_d3() 

#ggsave(violinplot_exp1, filename = here("outputs","violinplot_exp1.png"), dpi=300)


boxplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  #geom_point(position = position_dodge(width=0.75), alpha= .5)+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_simpsons() 

#ggsave(boxplot_exp1, filename = here("outputs","boxplot_exp1.png"), dpi=300)


boxplot_facet_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~group)+
  scale_color_simpsons() 

#ggsave(boxplot_facet_exp1, filename = here("outputs","boxplot_facet_exp1.png"), dpi=300)

```

Let's combine our plots:

```{r dpi= 300, fig.height=7, fig.width=9}

combined_plot_exp1 <- barplot_facet_exp1 / (lineplot_exp1+violinplot_exp1+boxplot_exp1)
combined_plot_exp1
```

And here, we save our plots to the `outputs` folder.
```{rmessage=F}
ggsave(combined_plot_exp1, filename = here("outputs","combined_plot_exp1.png"), dpi=300, width = 12)
```

- [A complete guide to ggplot](https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/)

# Descriptive Statistics

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','not_normal.png'))
```

Now, let's do some descriptive statistics. Now, we can open a new script called `data_analysis.r` and read some datasets. Then we use `skimr` package to describe our data.

```{r message=F, warning=F,}
narcissism_data <- read_csv(here("cleaned_data","narcissism_data.csv"))
narcissism_data %>% skimr::skim()
```

* *Exercise*: Open the dataset called `treatment_data.csv` and do a descriptive analysis:
```{r message=F, warning=F}
treatment_data <- read_csv(here("cleaned_data","treatment_data.csv"))
treatment_data %>% skimr::skim()
```

* *Exercise*: Do the same thing for the `memory_data.csv`.

```{r message=F, warning=F}
memory_data <- read_csv(here("cleaned_data","memory_data.csv"))
memory_data %>% group_by(time) %>%
  skimr::skim()
```


* *Exercise*: For this exercise, we use a dataset of one of my own studies. In this study, we asked participants to guess the physical brightness of some reasoning arguments and then we gave a cognitive ability test. (See the original study [here](https://osf.io/ebxnf/)). Open `ghasemi_brightness_exp4.csv` file and answer to the following questions:

1. How many participants did we test in total?
2. Find out how many male and female we tested.
3. Calculate mean and sd for age and cognitive ability (`cog_ability`).


```{r warning=F, message=F}
ghasemi_data <- read_csv(here("cleaned_data","ghasemi_brightness_exp4.csv"))

ghasemi_data %>% summarise(n = n_distinct(participant)) # number of participants:200

ghasemi_data %>% group_by (participant) %>% filter (row_number()==1) %>% group_by (gender) %>% summarise(n= n()) %>% ungroup() # 183 female, 17 male

ghasemi_data %>% dplyr::select (age, cog_ability) %>% skimr::skim() # mean and sd for age and cognitive ability
```



# Data Analysis

## t-test

Now, we use the treatment data to run three different independent t-tests. Suppose we did an experiment to compare the effectiveness of CBT vs. Psychodynamic therapies in decreasing anxiety, and depression and also in improving life satisfaction:

```{r}
# t.test (indep)
t.test(anxiety~treatment, data= treatment_data)
t.test(depression~treatment, data= treatment_data)
t.test(life_satisfaction~treatment, data= treatment_data)
```

In another experiment, suppose we have created a method to boost memory. Then, we recruit some participants, do a memory pre-test, implement the method, and do a memory post-test, Now, we want to see whether our method have improved participants' memory: 

```{r}
# t.test (paired)
t.test(memory_score~time, data= memory_data, paired= T)
```

Now that we learned about t-test, let's perform this test on our dataset. Is there a difference between groups at the first stage? Ideally, we want participants' depresion score at the first stage to be similar for both groups because we have not started our treatment yet. Previous graphs showed us that depression scores of the CBT and Psychodynamic groups at this stage are pretty close. Let's test that using an **independent t-test** (because we have 2 independent groups):

```{r}
# Is there a difference between groups at the first stage?
data_exp1 %>% 
  group_by(group) %>% 
  filter(stage=='stage1') %>% 
  ungroup () %>%
  t.test(depression_score~group, data = ., paired=FALSE)
```

Now, we wonder if psychotherapy treatments were effective at all, regardless of the treatment method. So, we would like to test if depresion score at the forth stage are lower than scores at the stage 2? Since a pair of score at stage 2 and stage 4 is coming from a same person, we use **paired t-test**.

```{r}
# Is there a difference between ratings of stage2 and stage4?
data_exp1 %>% 
  filter(stage=='stage2' | stage=='stage4') %>% 
  ungroup () %>%
  t.test(depression_score~stage, data = ., paired=TRUE)
```


* *Exercise*: John et al. (2019) investigated the consequences of backing down (changing one's mind in lights of evidence)and how other people view someone who change their mind. In their second experiments, they presented participants either with a person who changes their mind or a person who refuses to back down. Then, they asked participants to rate how intelligent and confident the person is (See the original study [here](https://www.hbs.edu/faculty/Publication%20Files/John%20et%20al%20-%20self-presentational%20consequences_b85b2c43-a5b5-474c-9e2c-e9853b10727e.pdf)). They reported that: 

> "Relative to the entrepreneur who did not back down, participants judged the entrepreneur who backed down as more intelligent (M_backed_down=5.13 out of 7, SD=1.09; M_did_not_back_down=3.97, SD=1.54; t(271.12)=−7.59, p < .001) but less confident (M_backed_down=4.50 out of 7, SD=1.36; M_did_not_back_down=5.65, SD=1.10; t(291.01)=8.08, p < .001).".

Open the `john_backdown_exp2.csv` file and try to reproduce their results. Run two separate independent t-test, one with `intelligent` as the dependent variable and one with `confident` as the dependent variable. For both t-test, use `back_down` as the between-subject independent variable.

```{r message=F, warning=F}
john_data <- read_csv(here("cleaned_data","john_backdown_exp2.csv"))


t.test(intelligent~back_down, data = john_data, paired=FALSE)
t.test(confident~back_down, data = john_data, paired=FALSE)
```


## Analysis of Variance (ANOVA)

Now, let's analysis our main experiment data: Do participants in the CBT group show better outcome compared to participants in the Psychodynamic group? Suppose we believe that participants should show lower depression after 5 or 10 sessions of both psychotherapy treatments and this decrease should be more pronounced for CBT than psychodynamic psychotherapy. If this is the case. we expect an interaction in the traditional **Analysis of Variance (AONVA)** test.

```{r message=F, warning=F}
aov_m1 <- aov_car (depression_score ~ group*stage +
                     Error(subject/stage), data = data_exp1)  
```

```{r echo=F}
knitr::kable(nice(aov_m1)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

As you can see, we found a significant main effect of stage and a significant group by stage interaction. We can use the `emmeans` package to do post-hoc tests.

```{r warning=F, message=F}
# main effect of stage
emmeans(aov_m1, 'stage')
pairs(emmeans(aov_m1, 'stage'), adjust= 'holm')
```


```{r warning=F, message=F}
# group by stage interaction
emmeans(aov_m1, "group", by= "stage")
update(pairs(emmeans(aov_m1, "group", by= "stage")), by = NULL, adjust = "holm") 
```

You can use the `afex_plot` function from afex to create beautiful plots. Those plots interacts nicely with ggplot:
```{r message=F, warning=F, dpi= 300}
afex_plot(aov_m1, x = "stage", trace = "group", error='between',
          line_arg = list(size=1),
          point_arg = list(size=3.5),
          data_arg = list(size= 1, color= 'grey', width=.4),
          data_geom = geom_boxplot,
          mapping = c("linetype", "shape", "fill"),
          legend_title = "Group") +
  labs(y = "Depression Score", x = "") +
  theme_bw()+ # remove the grey background and grid
  theme(axis.text=element_text(size=13),
        axis.title = element_text(size = 13),
        legend.text=element_text(size=13),
        legend.title=element_text(size=13),
        legend.position='bottom',
        legend.key.size = unit(1, "cm"),
        legend.background = element_rect(colour = 'black', fill = 'white', linetype='solid'))+
  scale_color_simpsons() +
  scale_fill_simpsons()
```


If you are interested in this topic, check out this nice tutorial about [using afex to run ANOVA](https://cran.r-project.org/web/packages/afex/vignettes/afex_anova_example.html), and also this interesting tutorial on the [emmeans package](https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/).

* *Exercise*: Rotello et al. (2018) investigated the association between the race (White vs. Black faces) and the gun-tool judgments. In their first experiments, they presented participants with 16 White male faces and 16 Black male faces, and following that 8 images of guns and 8 images of tools. They asked participants to judge if the object is a tool or a gun by pressing keyboard buttons. Then, they ran an ANOVA to see if participants' gun responses are higher for any of the races. So, they included prime race (Black, White) and target identity (gun, tool) as independent variables and participants' gun responses as dependent variable into their linear model (See the original study [here](https://online.ucpress.edu/collabra/article/4/1/32/112986/The-Shape-of-ROC-Curves-in-Shooter-Tasks)). They found that: 

> "Participants made more gun responses to guns than to tools, F(1,45) = 53243, p < 0.0001, η2g = 0.998. However, the race of the prime face did not matter, F(1,45) = 0.287, p > 0.59, η2g = 0.001, nor was there an interaction of prime race with target object, F(1,45) = 0.022, p > 0.88, η2g = 0.000)".

Open the `rotello_shooter_exp1.csv` file and try to reproduce their results. Run an ANOVA (type III) with `resp` as the dependent variable and target, prime, and their interaction as independent variables.


```{r message=F, warning=F}
# load the general data file
rotello_data <- read_csv(here("cleaned_data","rotello_shooter_exp1.csv"))

# ANOVA
rotello_aov <- aov_car (resp ~ target*prime +
           Error(subject/target*prime), data = rotello_data)
```

```{r echo=F}
knitr::kable(nice(rotello_aov)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```



## Correlation

Here, we want to check the correlation between variables on the `narcissism_data`. First, we need to remove `subject` column because it is not numeric:
```{r message=F, warning=F}
narcissism_data_cor <- narcissism_data %>%
  select(-subject)
```

```{r message=F, eval=F, fig.align='center', dpi=300}

#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()

#-- apaTables library:
narcissism_data_cor %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
```

```{r message=F, echo=F, fig.align='center', dpi=300}

#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)

```


* *Exercise*: Pennycook et al. (2020) investigated the relationship between actively open-minded thinking style about evidence (AOT-E) and different political, scientific, and religious beliefs (see the original paper [here](https://psyarxiv.com/a7k96)). In their first experiment, they calculated the correlation of AOTE and scientific beliefs items (global warming, evolution, etc.) and they found the following results:

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "adapted from [Pennycook et al. (2020)](https://psyarxiv.com/a7k96)"}
knitr::include_graphics(here('inputs','pennycook_corr.png'))
```

Open the `pennycook_aote_exp1.csv` file and try to reproduce their results by creating the same correlation matrix.

```{r message=F, eval=F}
pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) 


#---------- Base R:
cor(pennycook_data, method = "pearson",  use = "complete.obs")

#---------- Psych library:
pennycook_data %>% 
  psych::pairs.panels(method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#---------- Correlation library:
correlation::correlation(pennycook_data) %>% summary()

#---------- apaTables library:
pennycook_data %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
```


```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) %>%
  clean_names()

correlation::correlation(pennycook_data) %>% summary() %>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")

```


## Linear Regression

Here, we do single and multiple linear regreassion on the `narcissism_data`:

```{r}
m1 <- lm(mental_health~narcissism, data= narcissism_data)
```

```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(m1)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

```{r}
m2 <- lm(mental_health~narcissism+psychopathy, data= narcissism_data)
```

```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(m2)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

* *Exercise*: Trémolière and Djeriouat (2020) examined the role of *cognitive reflection* and *belief in science* in climate change skepticism. In their first study, they revealed that cognitive reflection and belief in science negetively predicted climate change skepticism even after controlling for demographic and cognitive ability variables (see the original paper [here](https://psyarxiv.com/vp8k6/)). 

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "adapted from [Trémolière and Djeriouat (2020)](https://psyarxiv.com/vp8k6/)"}
knitr::include_graphics(here('inputs','tremoliere_reg.png'))
```

Open the `tremoliere_data_exp1.csv` file and try to reproduce their results by running a multiple linear regression. Enter age, gender, education, belief in science, literacy, numeracy (Numtotal), and cognitive reflection as predictors and enter climate change skepticism (climato) as the outcome variable.

```{r message=F}
Tremoliere_data <- read_csv(here("cleaned_data","tremoliere_data_exp1.csv"))

Tremoliere_reg=lm(Climato ~ Age+ Gender+ Education+ BeliefInSciencetotal+ Literacy+ Numtotal+ CognitiveReflection,
                    data=Tremoliere_data)
```


```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(Tremoliere_reg)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

glance(Tremoliere_reg)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```


# Rmarkdown

To be completed...


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','rmarkdown_wizards.png'))
```


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','reproducibility_court.png'))
```

# References

- Ghasemi, O., Handley, S., & Howarth, S. (2020). The Bright Homunculus in our Head: Individual Differences in Intuitive Sensitivity to Logical Validity.

- John, L. K., Jeong, M., Gino, F., & Huang, L. (2019). The self-presentational consequences of upholding one’s stance in spite of the evidence. Organizational Behavior and Human Decision Processes, 154, 1-14.

- Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2020). On the belief that beliefs should change according to evidence: Implications for conspiratorial, moral, paranormal, political, religious, and science beliefs. Judgment and Decision Making, 15(4), 476.

- Rotello, C. M., Kelly, L. J., Heit, E., Vazire, S., & Vul, E. (2018). The Shape of ROC Curves in Shooter Tasks: Implications for Best Practices in Analysis. Collabra: Psychology, 4(1).

- Trémolière, B., & Djeriouat, H. (2020). Don’t you see that its cold! Exploring the roles of cognitive reflection, climate science literacy, illusion of knowledge, and political orientation in climate change skepticism.

- Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.